METHOD AND APPARATUS FOR IMPLEMENTING 
TWO ARCHITECTURES IN A CHIP 

1 CROSS-REFERENCE TO RELATED APPLICATION(S) 

2 This application is a continuation application of allowed U.S. application number 

3 09/496,845, filed February 2, 2000, of common title and inventorship. Priority is claimed 

4 from this prior application. 

5 BACKGROUND 

6 Microprocessors exist that implement a reduced instruction set computing (RISC) 

7 instruction set architecture (ISA) and an independent complex instruction set computing 

8 (CISC) ISA by emulating the CISC instruction with instructions native to the RISC 

9 instruction set. Instructions from the CISC ISA are called "macroinstructions." 

10 Instructions from the RISC ISA are called "microinstructions." Existing microprocessors 

11 do not implement these two architectures as efficiently as can be done. Some existing 

12 processors use more global wires routing data to many parts of the chip. This makes chip 

1 3 routing more difficult and less efficient. These techniques also complicate the timing and 

14 the pipeline of the processor. It is desirable to create an efficient means of implementing 

15 both architectures on a single chip, while leveraging existing hardware. In particular, it is 

16 desirable to localize processing and dispatching of the instructions, with minimal impact 

17 on the existing execution engine. 

18 SUMMARY 

19 The present invention is a method for implementing two architectures on a single 

20 chip. The method uses a fetch engine to retrieve instructions. If the instructions are 

21 macroinstructions, then it decodes the macroinstructions into microinstructions, and then 

22 bundles those microinstructions using a bundler, within an emulation engine. The bundles 

23 are issued in parallel and dispatched to the execution engine and contain pre-decode bits 

24 so that the execution engine treats them as microinstructions. Before being transferred to 

25 the execution engine, the instructions may be held in a buffer. The method also selects 

26 between bundled microinstructions from the emulation engine and native 

27 microinstructions coming directly from the fetch engine, by using a multiplexer or other 

28 means. Both native microinstructions and bundled microinstructions may be held in the 

29 buffer. The method also sends additional information to the execution engine. 

30 The present invention is also a computer system having a processor capable of 

31 implementing two architectures. The computer system has a fetch engine to retrieve 

32 instructions, an execution engine to execute the instructions, and an emulation engine to 
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1 decode macroinstructions into microinstructions before their execution. The emulation 

2 engine uses a bundler to bundle microinstructions and other information into groups. 

3 These bundles are delivered to the execution engine in parallel. 

4 DESCRIPTION OF THE DRAWINGS 

5 Figure 1 is a block diagram of a RISC microprocessor. 

6 Figure 2 is a block diagram of a RISC microprocessor showing further details of 

7 processing RISC and CISC instructions. 

8 Figure 3 is a block diagram of an emulation engine. 

9 Figure 4 is a block diagram showing the bundle format. 

10 Figure 5 is a flow chart showing the operation of the bundler. 

1 1 DETAILED DESCRIPTION 

12 A. RISC Microprocessor 

13 In a very simplistic sense, a RISC microprocessor can be divided into two 

14 portions: an instruction-fetch engine and an execution engine. Figure 1 shows a block 

15 diagram of a RISC microprocessor 10, having a fetch engine 20 and an execution engine 

16 40. In some implementations such as the implementation shown in Figure 1, the fetch 

17 engine is separated from the execution engine by a buffer 30. This buffer 30, also referred 

18 to as a queue, can be used to decouple the fetch engine 20 from the execution engine 40. 

19 While the fetch engine 20 writes new, incoming instruction into the buffer 30, the 

20 execution engine 40 reads and retires instructions from the buffer 30 in the same order as 

21 they were written. As long as there is room in the buffer 30, fetch engine 20 can get ahead 

22 of the execution engine 40. Once the buffer 30 fills, the fetch engine 20 must stall and 

23 wait for the execution engine 40 to take an instruction and free up a slot in the buffer 30. 

24 If the buffer 30 is empty, though, it is possible to create a bypass path 50 around the 

25 buffer 30 so that newly fetched instructions may pass from the fetch engine 20 directly to 

26 the execution engine 40 without first being written into the buffer 30. 

27 The present invention emulates a CISC ISA on a RISC machine. Figure 2 shows 

28 a block diagram of a RISC microprocessor 10 after implementation of the present 

29 invention. To implement the present invention, an emulation engine 60 is required to 

30 convert a stream of instruction bytes into a sequence of microinstructions that can be 

31 understood by the RISC execution engine 40. As shown in Figure 2, the emulation engine 

32 60 receives an instruction stream from the fetch engine 20 and delivers the 

33 microinstructions to the execution engine 40. In a preferred embodiment, the present 

34 invention uses a multiplexer 70 to select instructions from either the fetch engine 20 or 
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1 from the emulation engine 60. The multiplexer 70 then places the selected instructions 

2 into the instruction buffer 30. The emulation engine 60 does not have a bypass path 50 

3 around the instruction buffer 30 because adding a bypass path 50 would cause the 

4 machine to operate at a lower frequency even when executing in native (RISC) mode. 

5 When executing in native mode, the fetch engine 20 delivers 32 bytes of 

6 instruction stream to the execution engine 40. Within each 16 bytes, or "bundle," the 

7 RISC ISA defines there to be three 41 -bit instructions and five bits of template 

8 information. In addition, the fetch engine 20 sends other control information, called pre- 

9 decode bits, that it decodes from the 16 bytes of the instruction stream. The predecode 

10 bits are used by the execution engine 40 to help it efficiently distribute the six instructions 

11 to the proper execution units. 

12 When executing in emulation mode, it is necessary for the execution engine 40 to 

13 receive data in exactly the same format as it does in native mode. This allows the vast 

14 majority of the execution engine 40 to be designed only for native mode execution, while 

15 allowing it also to be used when in emulation mode. Thus, the emulation engine 60 must 

16 also deliver 32 bytes of instruction data along with the predecode bits calculated from 

17 those 32 bytes. 

18 As stated above, there are six native mode instructions contained in the 32 bytes 

19 of instruction stream. However, the performance requirements of this machine are such 

20 that in emulation mode, it is sufficient to deliver a maximum of two native mode 

21 instructions per cycle to the execution engine 40. This simplifies the design of the 

22 emulation hardware because of the template encodings and the dependency requirements 

23 between instructions that are imposed by the RISC ISA. By placing only one instruction 

24 and two NOPs together per bundle, the emulation hardware has a much easier job of 

25 adhering to these architectural requirements. 

26 B. The Bundler 

27 As noted above, the present invention operates by the use of a bundler 100. The 

28 bundler 100 is part of the emulation engine 60. Figure 3 is a block diagram showing parts 

29 of an emulation engine 60, having an emulation front end 80 and a bundler 100. The 

30 emulation engine 60 processes a sequence of operations (XUOPs). Between the emulation 

31 front end 80 and the bundler 100 is an XUOP queue 110, also referred to as an XUOP 

32 buffer. Within the emulation front end 80 is a microcode ROM (uROM) 90. The uROM 

33 90 delivers information to the bundler 100. The function of the bundler 100 is to take 

34 XUOPs and other information (including ZUOPs) delivered from the emulation front end 
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1 80 within the emulation engine 60, converts this information into a valid 16-byte bundle 

2 as defined by the RISC ISA, and deliver to the execution engine 40 two 16-byte bundles 

3 and associated pre-decode bits that can be decoded and executed in parallel without 

4 violating any architectural dependencies within the pair of bundles. Figure 4 shows a 

5 bundle format, having three 41 -bit Syllables and a 5-bit template. 

6 The emulation front end 80 is required to deliver the following bits of information 

7 (referred to as "ZUOPs"), in addition to other information not described herein. These 

8 ZUOPs are to be used by the bundler 100 as it creates the two 16-byte bundles. 

9 1. Syllable: 41 -bit instruction that is understood by the execution engine 40. 

10 2. Immediate: 32-bit immediate field that can be used as an operand. 

11 3. Op-Type: 3-bit field specifying which functional units can execute this 

12 type of Syllable. 

13 4. Sub-Type: 3-bit field specifying further information specific to a particular 

14 Op-Type. 

15 5. Bnd-Hint: 2-bit field indicating certain dependency restrictions between 

16 this Syllable, its predecessor and successor Syllables. 

17 6. Reg-Valid: 4-bit field specifying whether each of four separate fields 

18 within in the 41 -bit Syllable contain valid register identifiers. 

19 Figure 5 shows the operation of the bundler 100 in determining how many 

20 XUOPs to issue. The bundler 100 issues either 0, 1, or 2 XUOPs per issue-group. The 

21 bundler 100 attempts to issue two XUOPs at the same time, if possible. This 

22 determination is based on the number of XUOPs in the XUOP queue 110 and on the 

23 application of certain rules, described below. The bundler must first determine how many 

24 entries are in the XUOP queue 1 10, in a determination function 200. If the XUOP queue 

25 110 has no entries, then the bundler 100 outputs nothing, as shown by the no XUOP 

26 output function 210. 

27 If the XUOP queue 110 has one entry, then a determination function 220 

28 determines whether the Bnd-Hint indicates that two XUOPs must be issued in parallel. If 

29 two XUOPs do not need to be issued in parallel, then the one XUOP in the XUOP queue 

30 110 is dispatched into two 16-byte bundles in the one XUOP output function 230. If the 

31 determination function 220 determines that two XUOPs must be issued in parallel, then 

32 the bundler 100 outputs nothing in the no XUOP output function 210. 

33 If the XUOP queue 110 has two entries, then a determination function 240 

34 determines whether the Bnd-Hint indicates that two XUOPs must be issued in parallel. If 
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1 the determination function 240 determines that 2 XUOPs must be issued in parallel, then 

2 two XUOPs are dispatched into two 16-byte bundles in the two XUOP output function 

3 250. If the determination function 240 determines that two XUOPs are not required to be 

4 issued in parallel, then the determination function 260 determines whether any of the 

5 following five rules apply: 

6 1 . A specific bit in a machine specific register is set to restrict dual issue. 

7 2. Both XUOP's are destined for the same execution unit, unless they are 

8 both floating point operations or if they are both "general" ALU operations. 

9 3. Both XUOP's have a Sub-Type that indicates they modify floating point 

10 (FP)-stack resources. 

11 4. Both XUOP's have a Sub-Type that indicates they could flush the pipeline 

12 based on a comparison result. 

13 5. Comparing register fields that are indicated to be valid by the RegValid 

14 bits shows that there is a register dependency hazard between two XUOP's. 

15 If none of the five rules apply, then two XUOPs are dispatched into two 16-byte bundles 

16 in the two XUOP output function 250. If any of these five rules do apply, then one XUOP 

17 is dispatched into two 16-byte bundles, in the one XUOP output function 230. 

18 C. Transferring Extra Information 

19 In a preferred embodiment, the present invention also transfers extra information 

20 between the emulation engine 60 and the execution engine 40. In one embodiment, the 

21 32-bit Immediate is transferred from the emulation front end 80 to the execution engine 

22 40. The RISC ISA has a memory, long-immediate, integer template (MLI). In native 

23 mode this template specifies that the third Syllable within the 128-bit bundle is an integer 

24 instruction that operates on a 64-bit Immediate, 23 of which bits are contained in the third 

25 Syllable (I) and 41 of which bits are contained in the second Syllable (L). The execution 

26 engine 40 is designed to interpret the MLI template differently when in emulation mode. 

27 In emulation mode the third Syllable contains an integer instruction that operates on a 32- 

28 bit immediate, all of which is contained in the second Syllable. In one embodiment, the 

29 present invention uses the MLI template to send extra information between the emulation 

30 engine 60 and the execution engine 40. 

31 In another embodiment of the present invention, extra information may be sent 

32 between the emulation engine 60 and the execution engine 40 for floating-point 

33 operations. For those operations, the bundler 100 generates an MFI template. The MFI 

34 template specifies that the first syllable within the 128-bit bundle is a memory instruction, 
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1 the second syllable is a floating point instruction, and the third syllable is an integer 

2 instruction. When executing an FP instruction the second syllable contains an FP 

3 instruction, while the first and third syllables contain NOPs. In this case, extra bits of 

4 control information are sent to the execution engine as part of the NOP in the first 

5 Syllable. These bits of NOP Syllable are normally ignored in native mode, but they are 

6 treated specially by the execution engine 40 when in emulation mode. 

7 Although the present invention has been described in detail with reference to 

8 certain embodiments thereof, variations are possible. For example, although the sizes of 

9 certain data, bundles, templates, and other specific information were given by way of 

10 example, these specifics are only by way of illustration. Therefore, the present invention 

1 1 may be embodied in other specific forms without departing from the essential spirit or 

12 attributes thereof. It is desired that the embodiments described herein be considered in all 

13 respects as illustrative, not restrictive, and that reference be made to the appended claims 

14 for determining the scope of the invention. 
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